Characterization of Codonopsis pilosula subsp. tangshen plastome and comparative analysis of Codonopsis species

Codonopsis pilosula subsp. tangshen is one of the most important medicinal herbs used in traditional Chinese medicine. Correct identification of materials from C. pilosula subsp. tangshen is critical to ensure the efficacy and safety of the associated medicines. Traditional DNA molecular markers could distinguish Codonopsis species well, so we need to develop super or specific molecular markers. In this study, we reported the plastome of Codonopsis pilosula subsp. tangshen (Oliv.) D.Y. Hong conducted phylogenomic and comparative analyses in the Codonopsis genus for the first time. The entire length of the Codonopsis pilosula subsp. tangshen plastome was 170,672 bp. There were 108 genes in the plastome, including 76 protein-coding genes, 28 transfer RNA (tRNA), and four ribosomal RNA (rRNA) genes. Comparative analysis indicated that Codonopsis pilosula subsp. tangshen had an unusual large inversion in the large single-copy (LSC) region compared with the other three Codonopsis species. And there were two dispersed repeat sequences at both ends of the inverted regions, which might mediate the generation of this inversion. We found five hypervariable regions among the four Codonopsis species. PCR amplification and Sanger sequencing experiments demonstrated that two hypervariable regions could distinguish three medicinal Codonopsis species. Results obtained from this study will support taxonomic classification, discrimination, and molecular evolutionary studies of Codonopsis species.


Introduction
The Campanulaceae family contains 50 genera and approximately 1046 species, mainly found in the north and south temperate regions [1]. Codonopsis is a genus of perennial herbs in the family Campanulaceae. It includes 42 species primarily distributed in East, Central, and South Asia. Among them, 40 are found in China [2]. sequenced the complete plastomes of Codonopsis pilosula subsp. tangshen, to identify super or specific barcode for the discrimination of closely related medicinal Codonopsis species We characterized the genomic features. Then we compared the plastomes from the four Codonopsis species. Lastly, we developed and validated a set of molecular markers to distinguish the four species. The results obtained from this study laid a solid foundation for future taxonomic classification and marker development studies for Codonopsis species. In the following text, Codonopsis pilosula subsp. tangshen and "tangshen" are interchangeable for the sake of easy reading.

Plant material, DNA extraction, and sequencing
To obtain the complete plastome, we collected the fresh leaves of a young tangshen plant from the Huazhong Medicinal Botanical Garden (109˚76' E, 30˚18' N), Enshi, Hubei, China. To validate molecular markers of Codonopsis species, we collected fresh leaves of tangshen, C. lanceolata, and C. tsinlingensis from the Huazhong Medicinal Botanical Garden, Qichun Country, and the Qinling mountain, respectively. The detailed sample information used for sequencing and molecular marker validation is in the S1 Table in S1 File. All samples were collected with permission from the authorities. Then we extracted the genomic DNA with the plant genomic DNA kit (Tiangen Biotech, China). The purity of total DNA was assessed by 1.0% agarose gel electrophoresis. And the concentrations were measured using a Nanodrop Spectrophotometer 2000 (Thermo Fisher Scientific Inc., Waltham, MA, USA). We used the library preparation kit (New England Bio-Labs, America) to construct the DNA library with 1ug DNA. For paired-end library construction, the total DNA was sheared into fragments at approximately 500 bp long. Finally, we sequenced the genomic DNA with a Hiseq 2500 platform (Illumina, San Diego, CA). The remaining sample and DNA were stored in the Institute of Medicinal Plant Development (IMPLAD, accession number: Implad201808044).

Genome assembly and annotation
After obtaining the raw data, we removed the low-quality sequences using Trimmomatic software [30] to get clean data. These low-quality sequences meet the following conditions: (1) having the adaptor sequences; (2) the sequences with more than 50% bases having quality values of Q < 19; and (3) with more than 5% bases being "N." With the development of next-generation sequencing technologies, generating organelle genome assemblies from whole-genome sequencing (WGS) data would be the most accurate and labor-saving method. The plastome of tangshen was de novo assembled using NOVOplasty (v4.0) [31] with the parameter "-t 15, -R 30". We validated the correctness of the assembly by mapping all raw reads to the assembly using BWA [32] with the default settings. The annotation of the plastome was conducted initially using the CPGAVAS2 [33] webserver. The annotation problems were edited by Apollo [34] manually. Then we updated the annotation results by using the "UpdateAnno" module in CPGAVAS2. The cis-splicing and trans-splicing genes of the tangshen plastome were created using CPGview-RSG (http://www.herbalgenomics.org/cpgview). The content of GC was calculated using Editseq from the DNASTAR Lasergene package (v9) [35]. Finally, we submitted the genome sequence and annotations to GenBank and obtained the accession number MW415426.

Repeat and IR regions boundary analysis
Microsatellites are repeating DNA sequences consisting of 1-6 nucleotides (tandem arrays). It is commonly found in the genomes of all prokaryotes and eukaryotes and is called motifs [36].
The microsatellite sequence was analyzed using MISA software [37]. The search parameters were "1-10 2-6 3-5 4-5 5-5 6-5". The numbers before and after the "-" represent the unit size and minimal numbers of repeats, respectively. Then, we analyzed the tandem repeats with Tandem Repeats Finder (TRF) software [38] with the size of the repeat unit � 7. The parameters were "2 7 7 80 10 50 500 -f -d -m". 2,7,7 means weights for the match, mismatch, and indels, respectively; 80 and 10 mean detection parameters, matching probability Pm = 80 and indel probability Pi = 10; 50 means minimum alignment score; 500 represents maximum period size. Last, the dispersed repeats were analyzed using VMATCH software [39]. The search parameters for dispersed repeats were "-f -p -h 3 -l 30". The short explanation: -f: compute maximal forward repeats; -p compute maximal palindromes; -h search for repeats up to the given Hamming distance; -l: specify that repeats must have the given length.
Then we used the online tool IRSCOPE (https://irscope.shinyapps.io/irapp/) to compare the genes on the boundaries of the junction sites of the four plastomes from Codonopsis and four plastomes from closely related species. IRSCOPE is a generic local genomic visualizer tool designed to reflect the scaled genetic structure of plastome sequences over their respective four regions [40]. The size variation of angiosperm plastomes is primarily due to the expansion and contraction of the IR and SSC boundary regions. This analysis provides insight into the evolutionary differences among species in the Codonopsis genus.

PLOS ONE
The Codonopsis pilosula subsp. tangshen plastome and comparative analysis of Codonopsis species tsinlingensis (NC_056284.1) [29], respectively, to identify possible structure variation using the Gepard software [55]. The Mauve [56] software was used to align the sequences to identify intermolecular recombination events of the Codonopsis genus, and C. lanceolata was selected as the reference. The default values were used for all the parameters. To reveal genomic variations, we aligned the four Codonopsis plastomes using the mVISTA program with Shuffle-LAGAN mode [57]. Initially, we did this analysis directly with the plastomes we assembled and the other three Codonopsis species. We found an inversion in the large single-copy (LSC) region from the Mauve results, so we inverted this region of the tangshen plastome and used it as the reference for further analysis.
Through analyzing the Dotplot and Mauve results, we found two repeat sequences at both ends of the inversion. We first extracted the sequences, including the inversion and its flanking repeat sequences, using extractseq [58]. Then we obtained the reverse complemented sequence of one repeat sequence using revseq [58]. Finally, we aligned these two sequences using MAFFT [49]. GeneDoc [59] was used to visualize the results. To determine whether the repeat sequences are also present in the plastomes of the other three Codonopsis species, we used Gepard to get the approximate location of the repeat sequence and then determine its exact locations by BLASTn [60]. Then we extracted the sequences enclosed in these repeat sequences using extractseq [58] and aligned them with MAFFT.

Hypervariable region analysis
To identify the hypervariable regions among the four Codonopsis species, we firstly reversed the inversion region in the tangshen plastome. We wrote a custom script to extract the intergenic spacer regions (IGS) from the GenBank files of the four plastomes. We manually removed those IGS loaded across the boundary and in the inverted region and extracted the IGS sequences using extractseq. Then we aligned the extracted sequences using clustalw2 [61] with options "-type = DNA -gapopen = 10 -gapext = 2". Finally, we calculated the genetic distance of the intergenic regions using the K2p evolution model implemented in the distmat program from the EMBOSS package [58] with the parameters "-nucmethod 2". The threshold value for mapping is 5. It means to visualize the top five results in Fig 7. These five hypervariable regions can be used as molecular markers to distinguish the four Codonopsis species.

Identification and validation of molecular markers for species discrimination
C. minima is a species endemic to Korea [28], and we have not been able to find its cultivation in China. As a result, we could only collect samples from the other three Codonopsis species for molecular marker validation. We designed the primers for the five hypervariable regions using the Primer3 program (http://bioinfo.ut.ee/primer3-0.4.0/). The sequences used to design primers are shown in the S11 and S12 Figs in S1 File. PCR amplifications were performed in a final volume of 50 μL with 25 μL 2 Taq PCR Master Mix, 1 μM of each primer, 1 μL template DNA, and 22 μL ddH2O. All amplifications were carried out in a Pro-Flex PCR system (Applied Biosystems, Waltham, MA, USA) under the following conditions: denaturation at 94˚C for 2 min, followed by 35 cycles of 94˚C for 30 s, at specific annealing temperature (Tm) for 30 s, 72˚C for 60 s and 72˚C for 2 min as the final extension. PCR amplicons were visualized on 1.2% agarose gels and then subjected to Sanger sequencing on an ABI 3730XL instrument (Applied Biosystems, USA) using the same primers used for PCR amplification.
The length of the protein-coding sequence (CDS) in the tangshen plastome is 89,442 bp, representing 52.41% of the entire length. In contrast, the size of rRNA is 9,063 bp, and tRNA is 2,769 bp, representing 5.31% and 1.62% of the total length of the tangshen plastome sequence, respectively. The GC content analysis showed that the overall GC content is 38.15%, whereas GC contents for the CDS, rRNA, and tRNA genes are 38.40%, 54.76%, and 52.87%, respectively. In contrast, the GC contents for the LSC, SSC, and IRs are 36.75%, 32.07%, and 40.31%, respectively. Moreover, a total of 50,591 codons were identified in the tangshen plastome. In total, 64 codons encode 20 amino acids and three termination codons. Among these codons, 5,325 codons encode leucine, and 656 codes encode cysteine, representing the most and least abundant coded amino acids in the tangshen plastome (S3 Table in S1 File).
The IRs are the most conserved regions of the plastome, and their boundary often undergoes contraction and expansion. As shown in Fig 2, the length of LSC, SSC, and IRs differ among the four Codonopsis plastomes. The sizes of IRs in the Codonopsis plastomes range from 37,875 to 38,455 bp. There are eleven genes (rpl22, rps19, rpl2, ndhG, ndhF, psaC, ndhE, ndhG, rpl2, trnH, psbA) at the four junctions: JLB (LSC/IRb), JSB (IRb/SSC), JSA (SSC/IRa) and JLA (IRa/LSC). The JLB junction of the four Codonopsis plastomes was located between the genes rps19 and rpl2. However, the gene rps19 in C. tsinlingensis was further away from the JLB junction compared with those of the other three Codonopsis species.
The ndhE was found in both JSA and JSB junctions. In both C. lanceolata and C. minima, 156 bp of the ndhE gene locates in the SSC, and 147 bp locates in the IRa. In both tangshen and C. tsinlingensis, 145 bp of ndhE genes were located in the SSC, and 158 bp of the genes were located in the IRa. The JLA junctions were located to the right of the rpl2 and the left of the trnH in the four Codonopsis plastomes. The trnH in C. tsinlingensis is further away from the junction.

Repeat analysis
We analyzed three kinds of repeat sequences (microsatellite repeats, tandem repeats, and dispersed repeats) in the tangshen plastome. We identified 30 microsatellite repeats (S4 Table in S1 File), including 19 mononucleotides (A/T), eight dinucleotides (AT/AT), and three trinucleotide repeats (2 AGG/CCT, and 1 AAG/CTT). Among them, 14 microsatellite repeats in the proteincoding regions (rpoC1, rpoC2, cemA, clpP, ycf1, ndhA). There are 40 tandem repeats in the tangshen plastome (S5 Table in S1 File), meeting the two conditions that the length of the repeat unit is more than 30 bp and the similarity among the repeat unit sequences is more than 90%. The

Phylogenetic analysis
To construct the phylogenetic tree, we selected 19 plastome sequences from the genera of Adenophora, Campanula, Codonopsis, Leptocodon, and Platycodon, Cyphia, Lobelia, and Carpodetus (Fig 3). The Carpodetus serratus was selected as the outgroup taxa. The phylogenetic trees were constructed using the maximum likelihood (ML) method and Bayesian Inference (BI) method. The results of both methods had the same topological structure. The 15 species were divided into two main clades. In particular, the Cyphia and Lobelia species formed one clade, and the other 12 species formed another clade. Then Adenophora and Campanula species formed a clade, four

PLOS ONE
The Codonopsis pilosula subsp. tangshen plastome and comparative analysis of Codonopsis species Codonopsis species, one Platycodon species, and one Leptocodon species formed another clade. Interestingly, Leptocodon hirsutus is more closely related to the other three Codonopsis species than Codonopsis tsinlingensis. Therefore, the most closely related species is Leptocodon hirsutus in this phylogenetic tree. The bootstrap support scores for all branches were more than 96 in the tree built by the maximum likelihood (ML) method. The Bayesian inference (BI) posterior probabilities for all branches are 1, showing the high reliability of this tree.

Structure variation of tangshen plastome
The tangshen plastome was compared with those of A. thaliana, C. lanceolata, C. minima, and C. tsinlingensis for structural variations. The Dotplot results showed a high degree of collinearity between the Codonopsis and A. thaliana plastomes (S3-S7 Figs in S1 File). Mauve alignment of plastomes indicated that the tangshen plastome has an unusually large inversion in the LSC region than the other three Codonopsis plastomes (Fig 4). Then we compared the genes in the inversion region of the four Codonopsis species. We found the genes from atpH to rps2 are inverted in the tangshen plastome. The exact positions of this inversion are from 37,849 to 75,896.
Compared with the dispersed repeats shown in S7 Table in S1 File, we found one pair of dispersed repeat sequences flanking the inversion with a length of 164 bp. The alignment of these two repeat sequences is shown in Fig 5A. These two repeat sequences are reverse complementary, forming a palindromic repeat. The repeat unit on the 5' end is from 37,685 to 37,848.

PLOS ONE
The Codonopsis pilosula subsp. tangshen plastome and comparative analysis of Codonopsis species And the repeat unit on the 3' end is from 75,897 to 76,060. We further checked whether this repeat sequence was also present in the other three Codonopsis species. We found this repeat sequence also exists in the C. lanceolata plastome, but the length of 132 bp is shorter (Fig 5B). We did not find any repeat sequences in the plastomes of C. minima and C. tsinlingensis at similar positions. The repeat sequence in the tangshen plastome is likely to be involved in the genesis of the large inversion. However, the repeat sequence did not generate any inversion in the C. lanceolata plastome. We manually inverted the inversion to compare the sequences and used the inverted tangshen plastome as the reference. (Fig 6). The IRs of the four plastomes showed relatively lower sequence similarity than the LSC and SSC regions. Among the four different functional regions: exon, intron, gene, and conserved non-coding sequences (CNS), the CNS regions showed the highest variations. The exon regions of the four plastomes generally exhibited relatively higher conservation than the CNS and intron regions. Still, the two copies of the ycf1 gene showed a higher degree of variations among the four plastomes. In the IRs regions, the rrn23 gene with two copies exhibited the highest similarity among the four plastomes. Overall, the tangshen sequences exhibited relatively higher levels of sequence divergence among the Codonopsis species.

PLOS ONE
The Codonopsis pilosula subsp. tangshen plastome and comparative analysis of Codonopsis species

Development of molecular markers from the plastomes
To discriminate the available three Codonopsis species, we named the DNA marker from the five hypervariable regions rpl36-rps8, rpl14-rpl16, trnL-UAG-ccsA, rps16-trnT-UGU, and

PLOS ONE
The Codonopsis pilosula subsp. tangshen plastome and comparative analysis of Codonopsis species clpP-rpoA as Codonopsis marker 1-5, and Com1, Com2, Com3, Com4, and Com5, in short, respectively. The PCR primers used to amplify these five markers are shown in the S7 Table in S1 File. The Com2 and Com5 failed for PCR amplification. The Com3 sequences amplified from the three species had only one SNP and could not distinguish the three species. The three markers will not be discussed any further.
The product sizes of PCR amplification of Com1, Com3, and Com4 markers from the three Codonopsis species were similar to expected (S8 Fig in S1 File). The DNA fragments were extracted from each band and then subjected to Sanger sequencing. The sequencing results for the PCR products of Com1 and Com4 were identical to the expected sequences (S9 and S10 Figs in S1 File). The marker Com1, derived from the rpl36-rps8 region, has seven specific SNP loci. The first SNP loci shown in red squares can be used to differentiate two of the three Codonopsis species, except C. lanceolata. The second SNP loci, which are shown in red squares, can be used to differentiate tangshen and C. tsinlingensis (Fig 8A, S9 Fig in S1 File). The marker Com4 is derived from the rps16-trnT-UGU IGS region. It has five SNP loci and two Indel loci. Combined with the SNP and Indel loci shown in red squares can be used to distinguish the three Codonopsis species (Fig 8B, S10 Fig in S1 File). We also have tested the new primers on all four available

PLOS ONE
The Codonopsis pilosula subsp. tangshen plastome and comparative analysis of Codonopsis species Codonopsis plastomes obtained from NCBI and this study. These markers can discriminate all four species based on the SNP and Indel loci from Com1 and Com4 (S11 and S12 Figs in S1 File).

Discussion
In this study, we first reported the complete plastome of Codonopsis pilosula subsp. tangshen and then made a systematic comparative analysis of the Codonopsis plastomes. Notably, we have (1) sequenced and assembled the tangshen plastome; (2) compared the four Codonopsis plastomes and found an unusually large inversion having two repeat sequences at both ends of the inversion in the tangshen plastome; (3) identified the top five hypervariable regions for the development of potential molecular markers and validated two of them successfully; (4) carried out the phylogenetic analysis of Codonopsis and its relative genus.

PLOS ONE
The Codonopsis pilosula subsp. tangshen plastome and comparative analysis of Codonopsis species Accurate identification of herbal medicines helps ensure the safe and effective use of herbal medicines. DNA-based markers are suitable for the discrimination of different medicinal plants [62]. Three kinds of DNA barcode markers are of importance: conventional, super, and specific markers. The conventional markers are universal and are applied to all medicinal plants, including ITS2, psbA-trnH, rbcL, and matK. This group of markers can be used as the first line of tools to discriminate samples.
The success of species discrimination of these universal markers is mostly for distantly related species. It generally lacks the discrimination powers for closely related species. In these

PLOS ONE
cases, universal markers with higher-resolution or taxon-specific markers are needed. The complete plastomes have been acclaimed as a super barcode to distinguish related species, especially for taxonomically difficult taxa [26]. The super barcodes demonstrated high discriminative power and sufficient reliability in the previous study.
However, its use may be limited due to the insufficient amount of available DNA to assemble the complete genome, expensive sequencing costs to generate enough raw data to assemble the complete genome, and the complexity of data analysis. Therefore, searching for specific barcodes from hypervariable regions is important as a trade-off between universal and super DNA barcodes [21].
Several successful examples have been reported to develop taxon-specific markers [63]. In the present study, the universal makers have not been successful in discriminating the medicinal Codonopsis species. As a result, we sequence the plastome sequences and developed taxaspecific makers for the discrimination.
In addition to serving as the source of specific DNA barcodes, the plastomes can also be used to understand the phylogenetic relationship of closely related species. The phylogenetic relationship identified in this study is similar to those described previously. Classifications analyses among Codonopsis were previously reported based on four chloroplast gene regions: rbcL, matK, trnH-psbA, and the nuclear internal transcribed spacer (nrITS) [12].
Another application of the complete plastome is the identification of unusual structures that reflect the evolutionary history of the study subject. This study performed an in-depth comparative analysis of the four Codonopsis plastomes. We found a large inversion in the LSC region in the tangshen plastome. Interestingly, we found a pair of palindromic repeat sequences flanking this inversion. A similar repeat sequence was also present in the plastome of C. lanceolata, but the repeat length was 32 bp shorter. Similar repeat sequences were not found in the other two Codonopsis species' plastomes at similar locations. And no inversions were found in the other three Codonopsis plastomes.
We proposed two models for the genesis of the inversion in tangshen plastome. The first evolutionary model is that the repeat sequence found in the four Codonopsis plastomes was acquired before the speciation of the Codonopsis genus. During the differentiation of Codonopsis species, repeat sequences were further differentiated. The tangshen repeat sequence remains active, leading to rearrangement. However, the repeat sequences experience partial or complete deletion in the other three Codonopsis species, preventing the formation of the palindromic structure and the generation of the inversion. The second model is that the repeat sequence was acquired after the Codonopsis speciation independently. From a parsimonious point of view, the second model is less likely to be true. Another interesting question is whether or not the inversion occurred a long time ago. With the availability of more plastomes, we can calculate the percentage of plastomes having and not having the inversions, which might help answer this question.
Inversions mediated by palindromic repeat are not rare in the plastome [64]. The Campanulaceae species mainly consist of perennial herbs and have the most plastome structural variants based on previous studies on enzymatic loci, gene localization, and genome sequencing [2][3][4][5]. For example, previous studies have found numerous structural changes in the Adenophora and Trachelium species [4,6]. Therefore, studying the plastome structure is essential to understanding the phylogenetic relationships and evolutionary history among Campanulaceae species [7].
Previous studies have reported that genomic rearrangements occur due to incorrect recombination of repeat sequences and mispairing of sliding strands [65,66]. So the repeat sequences play a crucial role in plastome rearrangement [67]. For example, the psbA-trnH intergenic region is well known to have a small inversion [68]. This region is frequently used for DNA barcode analysis to distinguish different species [69]. It is generally thought that the palindromic repeat can form a stem-loop structure. Some exonucleases will cut the single-stranded DNA. When the DNA repair system is coming to repair the incised single-stranded DNA, it might connect the wrong DNA strand, forming an inversion (Fig 5A).
One limitation of the current study is the small number of Codonsopsis plastomes analyzed. The plastomes of only four of 42 Codonopsis species have been reported. Consequently, the results obtained from this study will only apply to the four Codonopsis species. Unfortunately, we could collect samples from three of the four species. The C. minima was an endemic species to Korea [28] and could not be found in China. Therefore, the molecular markers validated in this study might only help discriminate between these three species. More plastomes of Codonopsis species are needed in the future to elucidate the taxonomic classification and evolutionary history of the Codonopsis species. It should be pointed out that the three species we have analyzed have medicinal values and are most likely to be used indiscriminately.

Conclusions
We sequenced and analyzed the plastome of tangshen. Based on the four Codonopsis plastomes, we identified and validated two molecular markers using PCR amplification and Sanger sequencing experiments. Comparative analysis showed that the four Conodopsis plastomes have a lower level of genetic diversity. The tangshen plastome has a unique inversion, likely to be formed by repeat-mediated rearrangement.